human transcription
What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems
Mori, Kiyotada, Kawano, Seiya, Liu, Chaoran, Ishi, Carlos Toshinori, Contreras, Angel Fernando Garcia, Yoshino, Koichiro
Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Style-agnostic evaluation of ASR using multiple reference transcripts
McNamara, Quinten, Fernández, Miguel Ángel del Río, Bhandari, Nishchal, Ratajczak, Martin, Chen, Danny, Miller, Corey, Jetté, Migüel
Word error rate (WER) as a metric has a variety of limitations that have plagued the field of speech recognition. Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style parameters. As a result, we find that existing WER reports are likely significantly over-estimating the number of contentful errors made by state-of-the-art ASR systems. In addition, we have found our multireference method to be a useful mechanism for comparing the quality of ASR models that differ in the stylistic makeup of their training data and target task.
- North America > United States > Pennsylvania (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Germany > Saxony > Leipzig (0.04)
Developing an End-to-End Framework for Predicting the Social Communication Severity Scores of Children with Autism Spectrum Disorder
Mun, Jihyun, Kim, Sunhee, Chung, Minhwa
Autism Spectrum Disorder (ASD) is a lifelong condition that significantly influencing an individual's communication abilities and their social interactions. Early diagnosis and intervention are critical due to the profound impact of ASD's characteristic behaviors on foundational developmental stages. However, limitations of standardized diagnostic tools necessitate the development of objective and precise diagnostic methodologies. This paper proposes an end-to-end framework for automatically predicting the social communication severity of children with ASD from raw speech data. This framework incorporates an automatic speech recognition model, fine-tuned with speech data from children with ASD, followed by the application of fine-tuned pre-trained language models to generate a final prediction score. Achieving a Pearson Correlation Coefficient of 0.6566 with human-rated scores, the proposed method showcases its potential as an accessible and objective tool for the assessment of ASD.
HTEC: Human Transcription Error Correction
Sun, Hanbo, Gao, Jian, Wu, Xiaomin, Fang, Anjie, Cao, Cheng, Du, Zheng
High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models. Recent study~\cite{libricrowd} has found that every 1% worse transcription Word Error Rate (WER) increases approximately 2% ASR WER by using the transcriptions to train ASR models. Transcription errors are inevitable for even highly-trained annotators. However, few studies have explored human transcription correction. Error correction methods for other problems, such as ASR error correction and grammatical error correction, do not perform sufficiently for this problem. Therefore, we propose HTEC for Human Transcription Error Correction. HTEC consists of two stages: Trans-Checker, an error detection model that predicts and masks erroneous words, and Trans-Filler, a sequence-to-sequence generative model that fills masked positions. We propose a holistic list of correction operations, including four novel operations handling deletion errors. We further propose a variant of embeddings that incorporates phoneme information into the input of the transformer. HTEC outperforms other methods by a large margin and surpasses human annotators by 2.2% to 4.5% in WER. Finally, we deployed HTEC to assist human annotators and showed HTEC is particularly effective as a co-pilot, which improves transcription quality by 15.1% without sacrificing transcription velocity.
- Oceania > Australia (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New York (0.04)
- (4 more...)
TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and Punctuation model evaluation and selection
Behre, Piyush, Tan, Sharman, Shah, Amy, Kesavamoorthy, Harini, Chang, Shuangyu, Zuo, Fei, Basoglu, Chris, Pathak, Sayan
Punctuation and Segmentation are key to readability in Automatic Speech Recognition (ASR), often evaluated using F1 scores that require high-quality human transcripts and do not reflect readability well. Human evaluation is expensive, time-consuming, and suffers from large inter-observer variability, especially in conversational speech devoid of strict grammatical structures. Large pre-trained models capture a notion of grammatical structure. We present TRScore, a novel readability measure using the GPT model to evaluate different segmentation and punctuation systems. We validate our approach with human experts. Additionally, our approach enables quantitative assessment of text post-processing techniques such as capitalization, inverse text normalization (ITN), and disfluency on overall readability, which traditional word error rate (WER) and slot error rate (SER) metrics fail to capture. TRScore is strongly correlated to traditional F1 and human readability scores, with Pearson's correlation coefficients of 0.67 and 0.98, respectively. It also eliminates the need for human transcriptions for model selection.
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Who Decides if AI is Fair? The Labels Problem in Algorithmic Auditing
Mishra, Abhilash, Gorana, Yash
Labelled "ground truth" datasets are routinely used to evaluate and audit AI algorithms applied in high-stakes settings. However, there do not exist widely accepted benchmarks for the quality of labels in these datasets. We provide empirical evidence that quality of labels can significantly distort the results of algorithmic audits in real-world settings. Using data annotators typically hired by AI firms in India, we show that fidelity of the ground truth data can lead to spurious differences in performance of ASRs between urban and rural populations. After a rigorous, albeit expensive, label cleaning process, these disparities between groups disappear. Our findings highlight how trade-offs between label quality and data annotation costs can complicate algorithmic audits in practice. They also emphasize the need for development of consensus-driven, widely accepted benchmarks for label quality.
- Asia > India (0.29)
- North America > United States > Illinois > Cook County > Chicago (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
Facebook's latest AI can learn speech without human transcriptions
Speech recognition is an important cog in Big Tech's AI machinery. But, despite its ubiquity, speech recognition is still a work in progress. Today, Facebook is heralding a major breakthrough in the way it trains these systems to learn new languages. The company says it has developed a method of building speech recognition tools that don't require transcribed data. The time consuming task involves humans listening to and transcribing hours of audio, a monotonous process that has to be repeated for each language. Whereas Facebook's "unsupervised" system learns purely from speech audio and unpaired text to give it a better sense of what human communication sounds like.
- North America (0.06)
- Europe (0.06)
- Asia > Kyrgyzstan (0.06)
Comparing Google's AI Speech Recognition To Human Captioning For Television News
Most television stations still rely on human transcription to generate the closed captioning for their live broadcasts. Yet even with the benefit of human fluency, this captioning can vary wildly in quality, even within the same broadcast, from a nearly flawless rendition to near-gibberish. At the same time, automatic speech recognition has historically struggled to achieve sufficient accuracy to entirely replace human transcription. Using a week of television news from the Internet Archive's Television News Archive, how does the station-provided primarily human-created closed captioning compare with machine-generated transcripts generated by Google's Cloud Speech-to-Text API? Automated high-quality captioning of live video represents one of the holy grails of machine speech recognition. While machine captioning systems have improved dramatically over the years, there has still been a substantial gap holding them back from fully matching human accuracy.
- Media > Television (1.00)
- Leisure & Entertainment (1.00)